-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] make tracing a task public so self-tracing is possible #6972
base: master
Are you sure you want to change the base?
Conversation
35d68a7
to
1716248
Compare
cc @carllerche also for seeing how to make this pass CI without removing the test |
f26e43e
to
d93b7cf
Compare
[PR status: I'm having some people check out this PR to see if it helps them debug some "slow task" problems. After I see the lessons learned I'll make it less WIP] |
Chiming in, @arielb1 asked me to poke at this implementation in a somewhat realistic scenario where it might be useful. One such case that I've encountered is while operating a service that uses a Redis cluster as a backend for distributed precision/adaptive throttling, with various multi-step state operations and varying key cardinality. Redis's single threaded event loop and key space partitioning makes it prone to bottlenecks if code is poorly optimized. Debugging performance issues can be complex and require a decent amount of context on Redis's underlying behaviors. Since, Redis-perspective per-command logs typically don't include time spent while waiting in the event loop queue, metrics tend to be heavily aggregated, and performance issues are difficult to directly trigger without large-scale load testing (and with realistic traffic shapes). A bit of magic to systematically trace the slower futures without a lot of manual instrumentation by end users, and without introducing new bottlenecks due to blanket log/metric outputs, would be quite handy. This change would open up that door, and then other libraries could wrap their futures with self-tracing logic. I threw together a crude simulation where I ran a redis cluster locally (8 nodes, key space evenly distributed). I then simulated a big (and fairly hot) key alongside a bunch of normal keys. My wrapper future takes a trace + dumps that trace as well as a human-readable key name, in case the duration of the total lookup is > 500ms. This specific implementation is pretty heavy-handed for the use case above compared to a plain timer + simple event output. But, I can imagine end user scenarios where there are more complex futures that contain multiple i/o calls, different function contexts, etc, where the task trace might be handy. Ariel has another open PR that will also probably make access to the backtrace more useful. Probably the larger benefit is for libraries rather than usage directly by end users. It might be nice for @arielb1 to look at how this API feels in e.g. a tower layer, which seems like a good use case for this. My code and output are below. You'll see a subset of my bigkey calls tripping the threshold, as well as certain regular keys that are routed to the same node and stuck behind the bigkeys in line. Here is my code:
And here is my output:
And then for completeness, here you can see the overlap of the bigkey + regular key futures that were running long - all were routed to node 4:
|
Going to also throw in a test against a more realistic application that has more complicated usage of the tokio runtime to see if anything rattles loose. |
I did some more testing in a real application. It is an axum server that handles user flows across a series of endpoints that span vending javascript to the client, vending inputs for a hashcash proof of work challenge, then various crypto operations to validate the solution and vend an encrypted token. I injected the self-tracing task functionality in an outer tower middleware layer. It dumps a trace in case a request takes longer than 500ms. I then injected a 5% chance of a sleep inside a place where we make a remote call to AWS DynamoDB. This mimics a real performance bottleneck we encountered, due to overlarge table size, that was annoying to debug due to lack of specific metrics wrapping that remote call at the time. I then simulated user realistic user flows that hit all endpoints in the browser, acquired tokens, etc. I didn't test at load but did send enough traffic to hit a bit of concurrency. I didn't see any signs of strange behavior with regard to the executor or otherwise. Seemed to behave as expected. The stack trace was useful and pointed directly to the line of code that had the sleep added. See abbreviated output below:
|
There is some desire to make it possible for tasks to trace themselves to discover slow wakeups, something similar to the test I added.
This PR makes some functions public (but unstable) to make that easier.
I currently shared it with someone I'm working with to see whether it helps them debug their "slow task" problems. I'll make this PR less WIP after I get feedback from them.
WIP: actually add docs.